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METHOD TOR ANALYZING PART IAL GENE SEQUENCES 



FIELD OF THE INVENTION 
This invention relates to a computer-based method 
for building putative gene assemblies from partial gene 
sequences . 

BACKGROUND OF THE INVENTION 
The human genome is estimated to contain 3 billion 
base pairs of DNA. Within the genome, it is believed 
that approximately 50,000 to 100,000 gene coding 
sequences are dispersed. The gene sequences are thought 
to represent about 3% or approximately 90 million base 
pairs of the human genome. 

It is generally recognized that elucidation of the 
structure of all human genes and their organization 
within the genome will be beneficial to the advancement 
of medicine and biology. Databases such as the Genome 
Sequence Data Bank and GenBank serve as repositories of 
the nucleotide sequence data generated by ongoing 
research efforts. Despite the efforts to date, GenBank 
lists the sequences of only a few thousand human genes. 

Recent advances in automated, large-scale 
sequencing techniques have led to the initiation of two 
broad approaches to obtaining the sequence of the human 
genome. While the scientific debate continues as to the 
best approach, chromosome mapping and sequencing and 
gene sequencing projects have begun in earnest. 
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The Human Genome Initiative, a multinational effort 
having government backing in the United States and other 
countries, is attempting to characterize the genomes of 
humans and other model organisms on a chromosomal 
5 approach. In the private sector, large-scale sequencing 
of cDNA reverse transcribed from mRNA expressed in 
various human tissues, cell types and developmental 
stages is being pursued by a number of entities. 

After publication of the Kaxam-Gilbert and Sanger 
10 et al. nucleotide sequencing techniques, manual gene 

sequence assembly methods were practical for single gene 
or viral genome sequencing projects. As sequencing 
projects became more ambitious, manual techniques could 
be supplemented by computer-assisted sequence and contig 
15 assembly where overlaps between fragments were 

identified by software rather than by eye. However, the 
large scale of DNA sequencing projects and the rapidity 
with which sequence data is generated by automated 
sequencer machines has resulted in data analysis 
20 becoming a rate-limiting step in assembly of gene 

sequence data. The volume of data being generated by 
large-scale sequencing projects requires automated 
analysis in order to provide assembled sequence data in 
a timely manner. 
25 Towards this end, efforts have been made to improve 

computer-assisted assembly of nucleotide sequence data. 
For example, in "Automated DNA Sequencing and Analysis", 
Adams et al. eds., Academic Press (1995), E.W. Myers 
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presents a discussion of software systems for fragment 
assembly in Chapter 32, while S. Honda et al. describe 
in Chapter 33 the Genome Reconstruction Manager, a long- 
term software engineering project to develop a system to 
5 support large-scale sequencing efforts. 

Despite these efforts, a need exists for 
improvements over existing methods. The improved 
methods will provide computer-assisted nucleotide 
sequence assembly methods capable of more accurately and 
10 more efficiently assembling large amounts of sequence 
data. 

SUMMARY QF TH5 INVENTION 
Accordingly, one aspect of the present invention is 

15 a computer -based method for analyzing partial gene 
sequences. A computer-based iterative method for 
building putative gene assemblies from a plurality of 
partial gene sequences is provided. The method allows 
for the incremental addition of new partial gene 

20 sequences to be integrated with an existing plurality of 
putative gene assemblies. The method comprises 
preprocessing of the partial gene sequences and existing 
putative gene assemblies and assembling, responsive to 
grouping relationships, a consensus sequence from the 

25 preprocessed partial gene sequences and putative gene 
assemblies. Preprocessing comprises the steps of 
annotating regions within each of the plurality of 
partial gene sequences and each of the plurality of 
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existing putative gene assemblies; and grouping 
annotated partial gene sequences with other annotated 
partial gene sequences, where the other annotated 
partial gene sequences include components of the 
5 existing plurality of putative gene assemblies. 
Preprocessing allows for efficient and accurate 
assembly. 

BRIEF DESCRIPTION OF T HE DRAWING 
10 The accompanying drawing, which is incorporated in 

and constitutes a part of the specification, illustrates 
a preferred embodiment of the invention and together 
with the description serves to explain the principles of 
the invention. 

15 FIG. 1 is a block diagram of a method for analyzing 

partial gene sequences. 

DgTAIfrEP PBSCRTP TIPN OF THE PRE FERRED EMBODIMENTS 

The method of the invention provides for automated 
20 management of a large and continuously growing 

population of partial gene sequences. As used herein, 
the term "partial gene sequences" refers to a series of 
symbolic codes for nucleotide bases comprising a portion 
of a gene, DNA or RNA. Partial gene sequences can be 
25 derived by automated or manual methods well known to 
those skilled in the art and can be stored in a 
database . 
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The method is a computer-based iterative process 
for building putative gene assemblies from a plurality 
of partial gene sequences. As used herein, "putative 
gene assembly" is an arrangement of partial gene 
5 sequences aligned relative to one another and combined 
to yield a consensus sequence. The iterative nature of 
the method of the invention allows for the incremental 
addition of new partial gene sequences to be integrated 
with an existing plurality of putative gene assemblies 
10 in an efficient manner. 

Large numbers of partial gene sequences can be 
assembled by the method of the invention. The gene 
sequence assemblies produced by the method can be stored 
in a database and characterized for biological function. 
15 The nucleic acids represented by the gene sequence 

assemblies and the proteins the nucleic acids encoded 
are useful as drug discovery reagents and/ or biomedical 
research tools. 

As shown in PIG. 1, the method of the invention 
20 broadly comprises three steps of annotation, grouping 
and assembly. Efficient and accurate assembly of 
partial gene sequences is achieved through the assembly 
pre-processing steps of annotating and grouping and the 
use of the plurality of existing putative gene 
25 assemblies. The increased efficiency of the present 
method allows for high throughput of partial gene 
sequences . 
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Annotating is a process of identifying regions of 
partial gene sequences and putative gene assemblies that 
may cause two unlike sequences to be considered alike or 
otherwise produce inaccurate results in the grouping or 
5 assembly processes. These regions are likely to 
interfere with the correctness of the subsequent 
grouping and assembly steps of the method of the 
invention. The remaining unidentified regions are 
considered to contain useful information (for the 
10 purpose of grouping and assembly) and are used in the 
subsequent grouping and assembly steps. Regions 
identified as likely to interfere with subsequent steps 
are ignored in those steps. 

Examples of regions which can be identified in the 
15 annotating step are sequences from species other than 
the one of interest and nucleic acids or DNA from 
cellular structures such as ribosomes and mitochondria. 
Low information regions which occur multiple times in a 
sequence such as polynucleotide runs, simple tandem 
20 repeats (STRs) and genomic repetitive sequences, such as 
ALU, can also be identified. Further, ambiguous regions 
and regions resulting from experimental error or 
artifacts are also identified. 

After annotation, the annotated partial gene 
25 sequences are grouped with other annotated partial gene 
sequences. The step of grouping the annotated partial 
gene sequences is based on determining association 
relationships between an annotated partial gene sequence 
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and other existing annotated partial gene sequences, 
some of which may be components of previously identified 
putative gene assemblies. This process begins by 
ignoring the annotated regions from the partial gene 
5 sequences and previously identified putative gene 
assemblies. The partial gene sequences, with the 
annotated regions ignored, are then compared with the 
consensus sequence of previously identified putative 
gene assemblies, with the annotated regions ignored. 
10 The partial gene sequences are also compared with each 
other, ignoring the annnotated regions. The partial 
gene sequences are placed in groups based on the 
similarities found in these comparisons. Resulting 
groups thereby contain a collection of partial gene 
15 sequences that would appear to belong together, i.e., 
the grouping step produces a group of partial gene 
sequences that are thought to assemble together. 

For each group from the previous step, the 
positional ordering of the partial gene sequences 
20 relative to one another is taken as a group on the 

assumption that all partial gene sequences belong to the 
same putative gene assembly. One of the consequences of 
the ordering may be that more than one putative gene 
assembly may result should the ordering step uncover 
25 inconsistencies among the group of partial gene 
sequences . 

Once positional ordering has been completed for 
each putative gene assembly, a consensus sequence is 
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generated by a variety of contig assembly programs known 
to those of ordinary skill in the art. Exemplary is 
GELMERGE available from Genetics Computer Group, Inc. in 
Madison, WI. 

5 The method of the invention is computer-based . 

Accordingly, partial gene sequences, annotated partial 
gene sequences, grouped annotated partial gene sequences 
and assembled consensus sequences are embodied as 
signals in a computer while being processed by the 
10 method of the invention. 

Upon completion of the annotating, grouping, and 
assembling steps, the putative gene assemblies are 
stored in a database. Putative gene assemblies may be 
characterized on the basis of their sequence, structure, 
15 biological function or other related characteristics. 
Once categorized, the database can be expanded with 
information linked to the putative gene assemblies 
regarding their potential biological function, structure 
or other characteristics. 
20 For example, one method of characterizing putative 

gene assemblies is by homology to other known genes. 
Shared homology of a putative gene assembly with a known 
gene may indicate a similar biological role or function. 
Another exemplary method of characterizing putative 
25 gene assemblies is on the basis of known sequence 

motifs. Certain sequence patterns are known to code for 
regions of proteins having specific biological 
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CLAIMS 

1- A computer-based iterative method for building 
putative gene assemblies from a plurality of partial 
gene sequences comprising the steps of: 

(a) adding incrementally new partial gene 
sequences to be integrated with an existing plurality of 
putative gene assemblies; 

(b) preprocessing the partial gene sequences 
and existing putative gene assemblies; and 

(c) assembling a consensus sequence from the 
preprocessed partial gene sequences and putative gene 
assemblies. 

2. The method of claim 1 wherein the preprocessing 
step comprises the steps of: 

(1) annotating regions within each of the 
plurality of partial gene sequences and each of the 
plurality of existing putative gene assemblies; and 

(2) grouping annotated partial gene sequences 
with other annotated partial gene sequences, wherein 
some of the other annotated partial gene sequences may 
be components of existing putative gene assemblies, 

3. The method of claim 1 further comprising the 
step of: 

(d) characterizing the consensus sequence. 

4. The method of claim 3 wherein the 
characterization of the consensus sequence is on the 
basis of homology to known sequences. 

10 
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characteristics such as signal sequences, transmembrane 
domains, SH2 domains, etc. 

In addition to the methods just discussed, which 
can be automated, genes may also be characterized on the 
5 basis of expert commentary from relevant human 
specialists for given genes or by the results of 
biological experiments. 

It will be apparent to those skilled in the art 
that various modifications can be made to the present 
10 method for analyzing partial gene sequences without 

departing from the scope or spirit of the invention, and 
it is intended that the present invention cover 
modifications and variations of the method for analyzing 
partial gene sequences provided they come within the 
15 scope of the appended claims and their equivalents. 
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5. The method of claim 3 wherein the 
characterization of the consensus sequence is on the 
basis of similarities to known sequence motifs. 

6. A computer-based iterative method for building 
5 putative gene assemblies from a plurality of partial 

gene sequences comprising the steps of: 

(a) adding incrementally new partial gene 
sequences to be integrated with an existing plurality of 
putative gene assemblies; 
10 annotating regions within each of the 

plurality of partial gene sequences and each of the 
plurality of existing putative gene assemblies; 

(c) grouping annotated partial gene sequences 
with other annotated partial gene sequences, wherein 

15 some of the other annotated partial gene sequences may 
be components of existing putative gene assemblies; and 

(d) assembling, responsive to grouping 
relationships, a consensus sequence from the grouped 
annotated partial gene sequences . 

20 7 - The method of claim 6 further coirprising the 

step of: 

(e) characterizing the consensus sequence. 
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8- The method of claim 7 wherein the 
characterization of the consensus sequence is on the 
basis of homology to known sequences. 

9. The method of claim 7 wherein the 

5 characterization of the consensus sequence is on the 
basis of similarities to known sequence motifs. 

10. A computer-based iterative method for building 
putative gene assemblies from a plurality of partial 
gene sequences comprising the steps of: 

10 (a) adding incrementally new partial gene 

sequences to be integrated with an existing plurality of 
putative gene assemblies; 

(b) annotating regions within each of the 
plurality of partial gene sequences and each of the 

15 plurality of existing putative gene assemblies; 

(c) grouping annotated partial gene sequences 
with other annotated partial gene sequences, wherein 
some of the other annotated partial gene sequences may 
be components of existing putative gene assemblies; 

20 (3) assembling, responsive to grouping 

relationships, a consensus sequence from the grouped 
annotated partial gene sequences; and 

(e) characterizing the consensus sequence. 
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